Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training (add tensorboard debug, and mAP Calculation) #206

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

KUASWoodyLIN
Copy link

@KUASWoodyLIN KUASWoodyLIN commented Aug 6, 2018

Provide useful debug information on tensorboard

mAP scalars
image

Images
image

Distributions
image

Histograms
image

Copy link

@chenyuqing chenyuqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/home/tim/anaconda3/bin/python /home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py
/home/tim/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/home/tim/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
2018-09-22 14:30:49.472148: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-22 14:30:49.562588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-22 14:30:49.562875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce MX150 major: 6 minor: 1 memoryClockRate(GHz): 1.5315
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.36GiB
2018-09-22 14:30:49.562886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1)
Create YOLOv3 model with 9 anchors and 2 classes.
Traceback (most recent call last):
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1 and 255 for 'Assign_360' (op: 'Assign') with input shapes: [1,1,1024,21], [255,1024,1,1].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 529, in
yolo = Yolo()
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 73, in init
self.yolo_model = self.create_model(yolo_weights_path='model_data/yolo_weights.h5')
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 117, in create_model
model_body.load_weights(yolo_weights_path, skip_mismatch=True)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1161, in load_weights
f, self.layers, reshape=reshape)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 928, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2435, in batch_set_value
assign_op = x.assign(assign_placeholder)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 573, in assign
return state_ops.assign(self._variable, value, use_locking=use_locking)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 276, in assign
validate_shape=validate_shape)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 57, in assign
use_locking=use_locking, name=name)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2958, in create_op
set_shapes_for_outputs(ret)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2209, in set_shapes_for_outputs
shapes = shape_func(op)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2159, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 255 for 'Assign_360' (op: 'Assign') with input shapes: [1,1,1024,21], [255,1024,1,1].

Process finished with exit code 1

@chenyuqing
Copy link

Can't you help me to see what is wrong with my code ? THanks!

@KUASWoodyLIN
Copy link
Author

Hi @chenyuqing

I have try train_v2.py, but it look fine.
maybe you should check keras backend configuration file link,
I thank your "image_data_format" is not correct.
and my setting is

{
    "epsilon": 1e-07,
    "image_data_format": "channels_last",
    "floatx": "float32",
    "backend": "tensorflow"
}

make sure your settings are the same as mine.

@Borda
Copy link

Borda commented Aug 15, 2019

Hi, it seems that this repo is inactive for a while... (more than a year 😟)
Would you consider to pass your changes to this fork https://github.com/Borda/keras-yolo3 ?

tfukumori added a commit to tfukumori/keras-yolo3 that referenced this pull request Sep 22, 2020
tfukumori added a commit to tfukumori/keras-yolo3 that referenced this pull request Sep 22, 2020
@shocora
Copy link

shocora commented Oct 28, 2020

Hi! I'm in trouble because I can't learn. Which part of train_v2.py can I change to run it?

@tfukumori
Copy link

Hi! I'm in trouble because I can't learn. Which part of train_v2.py can I change to run it?

It seems that the versions of python, tensorflow and Keras are important.

You can find the following description in the repository
https://github.com/qqwweee/keras-yolo3

Python 3.5.2
Keras 2.1.5
tensorflow 1.6.0

I have also verified that it works with the following environments

Python 3.6
Keras 2.2.4
tensorflow 1.14.0

@shocora
Copy link

shocora commented Oct 29, 2020

thank you for reply. I matched the version but it doesn't work.
Is there any place to change the PATH other than lines 34,35 and 41,42 of train_v2.py?Also, is it okay if LOGS_PATH is empty at the time of the first learning?

@tfukumori
Copy link

When I first training it, it didn't matter if the LOGS_PATH folder (the default is yolo_logs) was empty.

I run the following command

python train_v2.py --yolo_train_file 2007_train.txt --yolo_val_file nano

2007_train.txt was created using voc_annotation.py, as described at https://github.com/qqwweee/keras-yolo3

To check the training results.

I run the following command

conda install tensorboard -y
tensorboard --logdir=<yolo_logs' full path> --host 0.0.0.0

In web browser, go to http://localhost:6006/

@shocora
Copy link

shocora commented Nov 10, 2020

I'm sorry to reply late.When I wrote the above command as it is, I got the following error.
Also,What does nano specify?
"File "train_v2.py", line 89, in init
images_choose = [self.val_images[i] for i in np.random.randint(0, len(self.val_images), 50)]
AttributeError: 'Yolo' object has no attribute 'val_images'"

@tfukumori
Copy link

tfukumori commented Nov 10, 2020

@shocora

"nano" specify

Specify "nano" if you do not specify a "verification" file or if it does not exist.

As you can see around the following lines of code, "train_v2.py" trains "training" and "validation" in a 9:1 ratio, regardless of the "validation" file is specified.

if not self.val_annotation_path == 'nano':

Why the error occurred

I think this is because the following lines of code were not executed because the folder for the temporary files was left undeleted in the event of an abnormal exit, for example.

self.train_data, self.val_data, self.val_images = self.read_txt_file()

Procedure before executing the command

Before executing the command, you must delete the temporary folder and move the resulting folder.

  • If a working folder (tmp_*) remains due to interruption, delete it.
  • If the results folder (yolo_logs) remains, delete, rename or move it.

@shocora
Copy link

shocora commented Nov 11, 2020

@tfukumori
Thank you
I am able to start learning.
However, it stopped with the following error.

"""
549 [Yolo loss: 36.249851]
Testing ...
[Yolo testing loss: 38.217436981201175]
Evaluate mAP
2020-11-11 11:59:25.702287: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-11-11 11:59:25.702377: E tensorflow/stream_executor/cuda/cuda_driver.cc:1032] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure ::
2020-11-11 11:59:25.702561: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status:
"""

Also,How can I adjust the value on the horizontal axis of training loss?
image

@tfukumori
Copy link

tfukumori commented Nov 11, 2020

"CUDA" error.

I'm not sure about the "CUDA" error.

From the error message, it is possible that the GPU is not powerful enough, but I'm not sure.

If it's due to a lack of GPU performance, then running it on the CPU or reducing the number of batches might solve the problem. (It's a trade-off for performance.)

https://jp.mathworks.com/matlabcentral/answers/427234-what-is-the-cause-of-cuda_error_launch_failed

Adjust the value on the horizontal axis of training loss

If you mean to change the settings of the graph, I don't know.

If you mean the number of epochs, then it seems to vary with the number of images and batches.

epoch = len(self.train_data) // self.step1_batch_size

epoch = len(self.train_data) // self.step2_batch_size

@shocora
Copy link

shocora commented Nov 16, 2020

@tfukumori
I was able to finish learning in 3 days. Thanks.

Why is "tmp_pred_files" empty before and after learning?

Also, When running yolo.py in full HD, is it better to change the following numbers?

"model_image_size" : (416, 416),

@shocora
Copy link

shocora commented Nov 30, 2020

I think mAP is usually between 0 and 1, but I get a value greater than or equal to 1.
I would appreciate it if you could tell me the cause.
キャプチャ

@tfukumori
Copy link

I think mAP is usually between 0 and 1, but I get a value greater than or equal to 1.
I would appreciate it if you could tell me the cause.
キャプチャ

Maybe that's because of the 100-fold, as you can see below.

``I don't know.
mAP * 100


https://github.com/qqwweee/keras-yolo3/blob/f4a9c40f4615cdbb774942507ecad3af5f05c990/train_v2.py#L419

@shocora
Copy link

shocora commented Dec 8, 2020

Is it this number as a result of multiplying by 100?
Also, What is the standard for the mAP calculation method used here?

@tfukumori
Copy link

Is it this number as a result of multiplying by 100?
Also, What is the standard for the mAP calculation method used here?

I think this will be helpful.

You can find it here: https://qiita.com/mdo4nt6n/items/08e11426e2fac8433fed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants